Model Selection

Multimodal Image-Text Understanding

# Multimodal Image-Text Understanding

Open-Qwen2VL is a multimodal model capable of receiving both images and text as input and generating text output.

Image-to-Text English

Qwen.qwen2.5 VL 3B Instruct GGUF

Qwen2.5-VL-3B-Instruct is a 3B-parameter vision-language model that supports image-to-text generation tasks.

Qwen.qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL-7B-Instruct is a 7B-parameter multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Qwen2.5 VL 3B Instruct GPTQ Int3

The GPTQ-Int3 quantized version of Qwen2.5-VL-3B-Instruct, suitable for multimodal image-text processing tasks with reduced VRAM usage and faster inference speed.

Transformers Supports Multiple Languages

Qwen2.5 VL 7B Instruct GPTQ Int3

This is an unofficial GPTQ-Int3 quantized version based on the Qwen2.5-VL-7B-Instruct model, suitable for multimodal image-text-to-text tasks.

Transformers Supports Multiple Languages

Paligemma2 3b Mix 224 Jax

PaliGemma 2 is an upgraded vision-language model based on Gemma 2, supporting multilingual image-text input and text output, specifically designed for vision-language tasks

Paligemma2 10b Mix 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text inputs to generate text outputs, suitable for various vision-language tasks.

LLaVA-1.6 is an open-source vision-language model that supports image-text-to-text tasks, with improved visual understanding and text generation capabilities.

Image Caption Large Copy

BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase